-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topological correctness #54
Conversation
fixtures/aa.gfa
Outdated
S 1 A SN:Z:123 SO:i:0 SR:i:0 | ||
S 2 A SN:Z:123 SO:i:0 SR:i:0 | ||
L 1 + 2 + * | ||
P 124 1+,2+ 4M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure what the parser would do, but this should be 1M (1 match)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Shouldn't it be 0M for overlap of length 0 if it's representing the sequence AA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it should be. I knew 4m was way too much :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/imports/fasta.rs
Outdated
@@ -57,13 +64,16 @@ pub fn import_fasta( | |||
.sequence(&sequence) | |||
.save(conn) | |||
}; | |||
let node = Node::create(conn, &seq.hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if i import the same file 2x, we'll have duplicate nodes + edges made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It's not great, but I don't see it as blocking. I think properly detecting the duplicates in general would take some work, worth doing long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it that much of a problem on the fasta import side? When would people import the same fasta file and expect them to collapse onto the same nodes and edges? I can understand on the vcf import you'd eventually want a heuristic that says something like "for a pair of edges (p, q), if there is already a pair of edges (a, b) with a shared node id (from the reference path) and identical coordinates on all side, then check to see if the new node has the same hash as the old node."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not an issue for fasta, but definitely VCF and GFA, since GFA doesn't do anything to canonically label segments and links
a930c4e
to
283dad8
Compare
migrations/core/01-initial/up.sql
Outdated
INSERT INTO sequence (hash, sequence_type, sequence, name, file_path, "length") values ("start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "OTHER", "start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy", "", "", 64), ("end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "OTHER", "end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", "", "", 64); | ||
INSERT INTO nodes (id, sequence_hash) values (1, "start-node-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"); | ||
INSERT INTO nodes (id, sequence_hash) values (2, "end-node-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"); | ||
UPDATE SQLITE_SEQUENCE SET seq = 2 WHERE name = 'nodes'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not, so that's nice. Removed
src/models/node.rs
Outdated
sequence_hash | ||
); | ||
let _ = conn.execute(&insert_statement, ()); | ||
conn.last_insert_rowid() as i32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have a bad feeling about using this instead of returning id. It seems like we're opening ourselves up to race conditions around this connection session and insertion execution order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I let autopilot write a lot of this code and I zoned on this. Change to use returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Biggest concern is the last_id thing biting us in the ass if we add concurrency eventually. I don't feel good about getting the id by a is-probably-right mechanism vs. one we know to be.
Eventually, we'll need to see how to make node creation stateless wrt uniqueness. I think that using from_node_id + alt seq + to_node_id may accomplish that. Thought is that we should expect multiple vcfs to be loaded over time and we should expect overlap between each vcf file. We shouldn't be duplicating those entries every time. I think that would also dedupe fastas as well (hash on from node (source), seq, to node (sink) and name i think may get it done, maybe need collection too.).
283dad8
to
cbf842a
Compare
Definitely agree that we need the deduping. I'm thinking about it more and am open to your suggestion about from_node_id + alt seq + to_node_id, but I need to convince myself that that's all that's needed. I don't have a specific example in mind but I feel like duplicates could pop up more generally, and maybe we need something stronger to handle more general cases. |
No description provided.